Goto

Collaborating Authors

 short video clips


REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Neural Information Processing Systems

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.


Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues

arXiv.org Artificial Intelligence

Pedestrian intention prediction is essential for autonomous driving in complex urban environments. Conventional approaches depend on supervised learning over frame sequences and require extensive retraining to adapt to new scenarios. Here, we introduce BF-PIP (Beyond Frames Pedestrian Intention Prediction), a zero-shot approach built upon Gemini 2.5 Pro. It infers crossing intentions directly from short, continuous video clips enriched with structured JAAD metadata. In contrast to GPT-4V based methods that operate on discrete frames, BF-PIP processes uninterrupted temporal clips. It also incorporates bounding-box annotations and ego-vehicle speed via specialized multimodal prompts. Without any additional training, BF-PIP achieves 73% prediction accuracy, outperforming a GPT-4V baseline by 18 %. These findings illustrate that combining temporal video inputs with contextual cues enhances spatiotemporal perception and improves intent inference under ambiguous conditions. This approach paves the way for agile, retraining-free perception module in intelligent transportation system.


Meta unveils artificial intelligence-generated video

#artificialintelligence

Meta announced that it was taking artificial intelligence-generated art to the next level by allowing users to create short video clips by just typing in a string of descriptive statements. Meta's AI division announced Thursday that it was unveiling Make-a-Video, an AI system that allows users to turn text prompts into short video clips of whatever was described. "Generative AI research is pushing creative expression forward by giving people tools to quickly and easily create new content," Meta said in a post describing the new technology. "With just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors, characters, and landscapes." We're pleased to introduce Make-A-Video, our latest in #GenerativeAI research!


Using synthetic data for deep learning video recognition

#artificialintelligence

In recent years, deep learning has completely revolutionized the fields of computer vision, speech recognition and natural language processing. Despite breakthroughs in all three fields, one common barrier for training neural networks to solve real-world problems remains the amount of labeled training data that is required to train a model. In some domains, like video understanding, gathering real world data can be prohibitively expensive and time consuming in the absence of innovative solutions. At TwentyBN, we solved this problem by building an in-house data factory for generating high-quality videos for neural networks to learn about the real world. We instruct crowd workers to record short video clips based on carefully predefined and highly specific descriptions.


DeepMind AI Teaches Itself About the World by Watching Videos

#artificialintelligence

A new artificial intelligence system teaches itself to recognize a range of visual and audio concepts by watching short video clips. Researchers at Google's DeepMind unit have developed an artificial intelligence (AI) system that teaches itself to recognize a range of visual and audio concepts by watching short video clips. For example, the new system can understand the concept of lawn mowing, even when it has not learned the words to describe what it is hearing or seeing. "We want to build machines that continuously learn about their environment in an autonomous manner," says University of California, Berkeley researcher Pulkit Agrawal. He notes the DeepMind project brings the field one step closer to the goal of creating AI that can teach itself by watching and listening to the world around it.


Google Motion Stills turns your Live Photos into GIFs: Free iOS app now makes it easier to create looping animations

Daily Mail - Science & tech

Despite their popularity, creating GIFS can still be a tricky process. But a new app from Google, called'Motion Stills', will allow you to easily create the moving images in just a few clicks. The app takes Live Photos, several frames automatically captured before and after you hit the camera app's shutter button, and turns them into GIFs or short video clips. 'We use our video stabilization technology to freeze the background into a still photo or create sweeping cinematic pans,' Ken Conley and Matthias Grundmann from the Google Research Machine Perception team said in a blog post. 'The resulting looping GIFs and movies come alive, and can easily be shared via messaging or on social media.'